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Abstract 



Learning theory has largely focused on two main learning scenarios. The first is the classical statistical 
setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online 
learning, completely adversarial scenario where adversary at every time step picks the worst instance to 
provide the learner with. It can be argued that in the real world neither of these assumptions are reason- 
able. It is therefore important to study problems with a range of assumptions on data. Unfortunately, 
theoretical results in this area are scarce, possibly due to absence of general tools for analysis. Focusing 
on the regret formulation, we define the minimax value of a game where the adversary is restricted in his 
moves. The framework captures stochastic and non-stochastic assumptions on data. Building on the se- 
quential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity 
for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce 
variation-type bounds. We then consider the i.i.d. adversary and show equivalence of online and batch 
learnability. In the supervised setting, we consider various hybrid assumptions on the way that x and 
y variables are chosen. Finally, we consider smoothed learning problems and show that half-spaces are 
online learnable in the smoothed model. In fact, exponentially small noise added to adversary's decisions 
turns this problem with infinite Littlestone's dimension into a learnable problem. 

1 Introduction 

We continue the line of work on the minimax analysis of online learning, initiated in [TJ [TT] [TU] . In these 
papers, an array of tools has been developed to study the minimax value of diverse sequential problems 
under the worst-case assumption on Nature. In many analogues of the classical notions from statistical 
learning theory have been developed, and these have been extended in |10| for performance measures well 
beyond the additive regret. The process of sequential symmetrization emerged as a key technique for dealing 
with complicated nested minimax expressions. In the worst-case model, the developed tools appear to give a 
unified treatment to such sequential problems as regret minimization, calibration of forecasters, Blackwell's 
approachability. Phi-regret, and more. 

Learning theory has been so far focused predominantly on the i.i.d. and the worst-case learning scenarios. 
Much less is known about learnability in-between these two extremes. In the present paper, we make progress 
towards filling this gap. Instead of examining various performance measures, as in [lOj, we focus on external 
regret and make assumptions on the behavior of Nature. By restricting Nature to play i.i.d. sequences, the 
results boil down to the classical notions of statistical learning in the supervised learning scenario. By not 
placing any restrictions on Nature, we recover the worst-case results of '11]. Between these two endpoints 
of the spectrum, particular assumptions on the adversary yield interesting bounds on the minimax value of 
the associated problem. 

By inertia, we continue to use the name "online learning" to describe the sequential interaction between 
the player (learner) and Nature (adversary). We realize that the name can be misleading for a number of 
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reasons. First, the techniques developed in [HUTU] apply far beyond the problems that would traditionally 
be called "learning" . Second, in this paper we deal with non-worst-case adversaries, while the word "online" 
often (though, not always) refers to worst-case. Still, we decided to keep the misnomer "online learning" 
whenever the problem is sequential. 

Adapting the game-theoretic language, we will think of the learner and the adversary as the two players 
of a zero-sum repeated game. Adversary's moves will be associated with "data" , while the moves of the 
learner - with a function or a parameter. This point of view is not new: game-theoretic minimax analysis 
has been at the heart of statistical decision theory for more than half a century (see [3]). In fact, there is a 
well-developed theory of minimax estimation when restrictions are put on either the choice of the adversary 
or the allowed estimators by the player. We are not aware of a similar theory for sequential problems with 
non-i.i.d. data. 

In particular, minimax analysis is central to nonparametric estimation, where one aims to prove optimal 
rates of convergence of the proposed estimator. Lower bounds are proved by exhibiting a "bad enough" 
distribution of the data that can be chosen by the adversary. The form of the minimax value is often 

infsupE||/-/f (1) 

where the infimum is over all estimators and the supremum is over all functions / from some class T . It is 
often assumed that Yt = f{Xt) + e*, with et being zero-mean noise. An estimator can be thought of as a 
strategy, mapping the data {{Xt,Yt)}f^i to the space of functions on X. This description is, of course, only 
a rough sketch that does not capture the vast array of problems considered in nonparametric estimation. 

In statistical learning theory, the data are i.i.d. from an unknown distribution Pxxy and the associated 
minimax problem in the supervised setting with square loss is 

.^batch, sup ^ .^j fg^y _ ^^^^^^2 _ -^^ - f{X)f\ (2) 

where the infimum is over all estimators (or learning algorithms) and the supremum is over all distributions. 
Unlike nonparametric regression which makes an assumption on the "regression function" / G J^, statistical 
learning theory often aims at distribution-free results. Because of this, the goal is more modest: to predict 
as well as the best function in rather than recover the true model. In particular, ^ sidesteps the issue of 
approximation error (model misspecification) . 

What is known about the asymptotic behavior of ([2])? The well-developed statistical learning theory tells 
us that ([2]) converges to zero if and only if the combinatorial dimensions of J- (that is, the VC dimension for 
binary-valued, or scale-sensitive for real-valued functions) are finite. The convergence is intimately related 
to the uniform Glivenko-Cantelli property. If indeed the value in ^ converges to zero, an algorithm that 
achieves this is Empirical Risk Minimization. For unsupervised learning problems, however, ERM does not 
necessarily drive the quantity Ef{X) — inf/gjrE/(X) to zero. 

The formulation ([2| no longer makes sense if the data generating process is non-stationary. Consider the 
opposite from i.i.d. end of the spectrum: the data are chosen in a worst-case manner. First, consider an 
oblivious adversary who fixes the individual sequence xi, . . . ,xt ahead of the game and reveals it one- by-one. 
A frequently studied notion of performance is regret, and the minimax value can be written as 



oblivious 



= jnf sup E/j 



1 ^ 

t = l 



ft{xt) 



1 ^ 

inf - y f{xt) 



(3) 



where the randomized strategy for round t is ft : X*^^ H> Q, with Q being the set of all distributions on F. 
That is, the player furnishes his best randomized strategy for each round, and the adversary picks the worst 
sequence. 



2 



A non-oblivious (adaptive) adversary is, of course, more interesting. The protocol for the online interaction 
is the following: on round t the player chooses a distribution qt on the adversary chooses the next move 
xt G X, the player draws ft from qt, and the game proceeds to the next round. All the moves are observed 
by both players. Instead of writing the value in terms of strategies, we can write it in an extended form as 



Vt = inf sup E 

gieQ xiEX 



■ inf sup E 

qreQ xtEX fT~qT 



T T 

t=l t=l 



(4) 



This is precisely the quantity considered in . The minimax value for notions other than regret has been 
studied in (lOj . In this paper, we are interested in restricting the ways in which the sequences (xi, . . . ,xt) 
are produced. These restrictions can be imposed through a smaller set of mixed strategies that is available 
to the adversary at each round, or as a non-stochastic constraint at each round. The formulation we propose 
captures both types of assumptions. 

The main contribution of this paper is the development of tools for the analysis of online scenarios where 
the adversary's moves are restricted in various ways. Further, we consider a number of interesting scenarios 
(such as smoothed learning) which can be captured by our framework. The present paper only scratches 
the surface of what is possible with sequential minimax analysis. Many questions are to be answered: For 
instance, one can ask whether a certain adversary is more powerful than another adversary by studying the 
value of the associated game. 

The paper is organized as follows. In Section [2] we define the value of the game and appeal to minimax 
duality. Distribution-dependent sequential Rademacher complexity is defined in Section [3] and can be seen 
to generalize the classical notion as well as the worst-case notion from jllj . This section contains the main 
symmetrization result which relies on a careful consideration of original and tangent sequences. Section [4] 
is devoted to analysis of the distribution-dependent Rademacher complexity. In Section [s] we consider 
non-stochastic constraints on the behavior of the adversary. From these results, variation-type results are 
seamlessly deduced. Section [6] is devoted to the i.i.d. adversary. We show equivalence between batch and 
online learnability. Hybrid adversarial-stochastic supervised learning is considered in Section [7] We show 
that it is the way in which the x variable is chosen that governs the complexity of the problem, irrespective 
of the way the y variable is picked. In Section [H] we introduce the notion of smoothed analysis in the online 
learning scenario and show that a simple problem with infinite Littlestone's dimension becomes learnable 
once a small amount of noise is added to adversary's moves. Throughout the paper, we use the notation 
introduced in jlll llOj . and, in particular, we extensively use the "tree" notation. 



2 Value of the Game 

Consider sets and X, where is a closed subset of a complete separable metric space. Let Q be the set 
of probability distributions on J- and assume that Q is weakly compact. We consider randomized learners 
who predict a distribution qt Q on every round. 

Let V be the set of probability distributions on X. We would like to capture the fact that sequences 
(xi, . . . ,xt) cannot be arbitrary. This is achieved by defining restrictions on the adversary, that is, subsets 
of "allowed" distributions for each round. These restrictions limit the scope of available mixed strategies for 
the adversary. 

Definition 1. A restriction Vi.t on the adversary is a sequence Vi, . . . ,7^t of mappings Vt '■ X*^^ i— > 2^ 
such that 'Pt{xi;t~i) is a convex subset ofV for any xi;t~i G X*^^. 

Note that the restrictions depend on the past moves of the adversary, but not on those of the player. We 
will write Vt instead of Vt{xi-t~i) when xi;t-i is clearly defined. 
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Using the notion of restrictions, we can give names to several types of adversaries that we wih study in this 
paper. 



• A worst-case adversary is defined by vacuous restrictions Vt{xi.t-i) = V. That is, any mixed strategy 
is available to the adversary, including any deterministic point distributions. 

• A constrained adversary is defined by Vtixi-^^ -^) being the set of all distributions supported on the set 
{x G X : Ct{xi, . . . , xt-i, x) = 1} for some deterministic binary-valued constraint Cf. The deterministic 
constraint can, for instance, ensure that the length of the path determined by the moves xi,. . . ,Xt 
stays below the allowed budget. 

• A smoothed adversary picks the worst-case sequence which gets corrupted by an i.i.d. noise. Equiva- 
lently, we can view this as restrictions on the adversary who chooses the "center" (or a parameter) of 
the noise distribution. For a given family Q of noise distributions (e.g. zero- mean Gaussian noise), the 
restrictions are obtained by all possible shifts Vt = {g{x ~ ct) ■ g € G,ct (z X}. 

• A hybrid adversary in the supervised learning game picks the worst-case label yt, but is forced to draw 
the Xf-variablc from a fixed distribution [7]. 

• Finally, an i.i.d. adversary is defined by a time-invariant restriction Vt{xi;t-i) — {p} for every t and 
some p Cz V. 



For the given restrictions Vi.t, we define the value of the game as 
Vt(7^1:t) = inf sup E inf sup E • • • inf sup E 



J2Mxt)~ mi Tfixt) 

,t=i t=i 



(5) 



where ft has distribution qt and Xt has distribution pt. As in [llj . the adversary is adaptive., that is, chooses 
Pt based on the history of moves fi.t-i and Xi-t^i. 

At this point, the only difference from the setup of [TT] is in the restrictions Vt on the adversary. Because 
these restrictions might not allow point distributions, the suprema over pt's in ([5| cannot be equivalently 
written as the suprema over x^'s. 

The value of the game can also be written in terms of strategies tt = {'^t}f=i and t = {Tt}]~i for the player 
and the adversary, respectively, where TTt : {J- x X x VY^^ Q and Tt : {J- x X x Q)*^^ V. Crucially, 
the strategies also depend on the mappings Vi.t- The value of the game can equivalently be written in the 
strategic form as 



Vt('Pi:t) = inf sup E 



. E 



■ T T ■ 

Y.ft{xt)~ \ni Y.f{xt) 



(6) 



A word about the notation. In the value of the game is written as Vt(-^), signifying that the main 
object of study is F. In [T^, it is written as Vt(^, $t) since the focus is on the complexity of the set 
of transformations <I>t and the payoff mapping In the present paper, the main focus is indeed on the 
restrictions on the adversary, justifying our choice Vt('Pi:t) for the notation. 

The first step is to apply the minimax theorem. To this end, we verify the necessary conditions. Our 
assumption that is a closed subset of a complete separable metric space implies that Q is tight and 
Prokhorov's theorem states that compactness of Q under weak topology is equivalent to tightness [15]. 
Compactness under weak topology allows us to proceed as in [11] . Additionally, we require that the restriction 
sets are compact and convex. 
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Theorem 1. Let T and X he the sets of moves for the two players, satisfying the necessary conditions 
for the minimax theorem to hold. Let Vi.t he the restrictions, and assume that for any xi;t~i, Vt{xi;t-i) 
satisfies the necessary conditions for the minimax theorem to hold. Then 



Vt{V 



1:T) 



sup E^j, 



'Pi 



si_ip lE^^^p^ 



V inf E 

.t=i 



xt~Pt 



(7) 



The nested sequence of suprema and expected values in Theorem [T] can be re-written succinctly as 



Vt{Vi:t) = sup E2,j^piE2.2^p2(.|:ci) 



.E, 



sup E 



■T~PTi.-\xi:T-l) 

T 



(8) 



^ mf^E^.^p, [ft{xt)] - inf^E/C 



,i=l 



Xt] 



t=i 



where the supremum is over all joint distributions p over sequences, such that p satisfies the restrictions 
as described below. Given a joint distribution p on sequences {xx, . . . ,xt) & X'^ , we denote the associated 
conditional distributions by pt{-\xi;t~i). We can think of the choice p as a sequence of oblivious strategies 
{pt : A"*"^ H> Vyf^i, mapping the prefix xi-t-i to a conditional distribution pt{-\xi;t~i) S Vt{xi;t-i). We 
will indeed call p a "joint distribution" or an "oblivious strategy" interchangeably. We say that a joint 
distribution p satisfies restrictions if for any t and any xi-t-i G Pt(-|a:^i:t-i) G '^((a^iit-i)- The set of 

all joint distributions satisfying the restrictions is denoted by *p. We note that Theorem [T] cannot be deduced 
immediately from the analogous result in |llj, as it is not clear how the restrictions on the adversary per 
each round come into play after applying the minimax theorem. Nevertheless, it is comforting that the 
restrictions directly translate into the set *p of oblivious strategies satisfying the restrictions. 

Before continuing with our goal of upper-bounding the value of the game, let us answer the following question: 
Is there an oblivious minimax strategy for the adversary? Even though Theorem [l] shows equality to some 
quantity with a supremum over oblivious strategies p, it is not immediate that the answer to our question 
is affirmative, and a proof is required. To this end, for any oblivious strategy p, define the regret the player 
would get playing optimally against p: 



VP 



inf E^j^pj inf E^2^p,(.|^j) 



inf E, 



■t~Pt{-\xi,t~i) 



^ t=l 



(9) 



The next proposition shows that there is an oblivious minimax strategy for the adversary and a minimax 
optimal strategy for the player that does not depend on its own randomizations. The latter statement for 
worst-case learning is folklore, yet we have not seen a proof of it in the literature. 

Proposition 2. For any ohlivious strategy p. 



Vt{Vi:1 



> VP 



infE 



)E,,^pJt{xt) - inf ^/(a;t) 



(10) 



with equality holding for p* which achieves the supremun^ in ([s]) . Importantly, the infimum is over strategies 
TV = {TTtJt^i of the player that do not depend on player's previous moves, that is irt : A"*"^ i— )• Q. Hence, 
there as an oblivious minimax optimal strategy for the adversary, and there is a corresponding minimax 
optimal strategy for the player that does not depend on its own moves. 



Proposition [2] holds for all online learning settings with legal restrictions Vi-.t, encompassing also the no- 
restrictions setting of worst-case online learning [TT]. The result crucially relies on the fact that the objective 
is external regret. 

^Here, and in the rest of the paper, if a supremum is not achieved, a slightly modified analysis can be carried out. 
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3 Symmetrization and Random Averages 



Theorem [T] is a useful representation of the value of the game. As the next step, we upper bound it with 
an expression which is easier to study. Such an expression is obtained by introducing Rademacher random 
variables. This process can be termed sequential symmetrization and has been exploited in [1] llll 110) . 
The restrictions Vt, however, make sequential symmetrization a bit more involved than in the previous 
papers. The main difficulty arises from the fact that the set Vt{xi;t~i) depends on the sequence xi-t-i, and 
symmetrization (that is, replacement of Xs with x'^) has to be done with care as it affects this dependence. 
Roughly speaking, in the process of symmetrization, a tangent sequence x'^^x^t ■ ■ is introduced such that 
xt and are independent and identically distributed given "the past". However, "the past" is itself an 
interleaving choice of the original sequence and the tangent sequence. 

Define the "selector function" x : x A" x {±1} ^ X hy 



X{x,x\t) = 



x' if e = 1 
X if e = — 1 



When Xt and x't are understood from the context, we will use the shorthand Xt(^) '■= xixtix't^e). In other 
words, Xt selects between xt and depending on the sign of e. 

Throughout the paper, we deal with binary trees, which arise from symmetrization [llj . Given some set 
Z, an Z-valued tree of depth T is a sequence (zi, . . . ,zt) of T mappings : {±1}'"-^ ^ Z. The T-tuple 
e — (ei, . . . , et) S {±1}"^ defines a path. For brevity, we write zt(e) instead of zt(ei:4_i). 

Given a joint distribution p, consider the "(A" x A")"^^^ i— > Vi^X x A")"- valued probability tree p = 
{Pi , . . . , Pt) defined by 

Pt(ei:t-i) {(xi,x[), (xT-i,a:^T-i)) = (Pt(-|Xi(ei), ■ • • , Xt-i(et-i)),Pi(-|Xi(ei), • • • , Xt-i(et-i)))- (H) 

In other words, the values of the mappings Pi(e) are products of conditional distributions, where conditioning 
is done with respect to a sequence made from Xg and x'^ depending on the sign of Sg- We note that 
the difficulty in intermixing the x and x' sequences does not arise in i.i.d. or worst-case symmetrization. 
However, in-between these extremes the notational complexity seems to be unavoidable if we are to employ 
symmetrization and obtain a version of Rademacher complexity. 

As an example, consider the "left-most" path e = —1 in a binary tree of depth T, where 1 — (1, . . . , 1) 



is a T-dimensional vector of ones. Then all the selectors x{xt,Xt,et) in the definition (11) select the se- 
quence xi, . . . ,xt- The probability tree p on the "left-most" path is, therefore, defined by the conditional 
distributions pt{-\xi;t-i)- Analogously, on the path e — 1, the conditional distributions are pt{-\x'i.t_i). 

Slightly abusing the notation, we will write Pt(e) {{xi, x'l), . . . , (xf_i, x[_i)^ for the probability tree since Pf 
clearly depends only on the prefix up to time t—1. Throughout the paper, it will be understood that the tree 
p is obtained from p as described above. Since all the conditional distributions of p satisfy the restrictions, 
so do the corresponding distributions of the probability tree p. By saying that p satisfies restrictions we 
then mean that p e *p. 

Sampling of a pair of Af-valued trees from p, written as (x,x') ^ p, is defined as the following recursive 
process: for any e e {±1}'^, 

(xi(6),xUe))^Pi(e) 

(x,(e),x;(e))^p,(e)((xi(e),x;(e)),...,(x,_i(e),x;_i(e))) for 2<t<T (12) 



To gain a better understanding of the sampling process, consider the first few levels of the tree. The roots 
xi,x']^ of the trees x,x' are sampled from pi, the conditional distribution for t = 1 given by p. Next, say, 
ei = +1. Then the "right" children of xi and x'l are sampled via X2(+l), X2(+l) ^ p2{-\x'i) since Xi(+1) 
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selects x[. On the other hand, the "left" children X2(— 1), X2(— 1) are both distributed according to p2{-\xi). 
Now, suppose ei = +1 and £2 — —1. Then, X3(+l, — 1), X3(+l, — 1) are both sampled from p3(-|x'j, X2(+l)). 

The proof of Theorem [3] reveals why such intricate conditional structure arises, and Section [4] shows that this 
structure greatly simplifies for i.i.d. and worst-case situations. Nevertheless, the process described above 
allows us to define a unified notion of Rademacher complexity for the spectrum of assumptions between the 
two extremes. 

Definition 2. The distribution- dependent sequential Rademacher complexity of a function class J- C M"^ is 
defined as 

T 



sup Vet/(xt(e)) 



where e = (ei, . . . , e-r) is a sequence of i.i.d. Rademacher random variables and p is the probability tree 
associated with p. 

We now prove an upper bound on the value Vt('Pi:t) of the game in terms of this distribution-dependent 
sequential Rademacher complexity. This provides an extension of the analogous result in [TP to adversaries 
more benign than worst-case. 



Theorem 3. The minimax value is hounded as 



VtCPi-t) < 2 sup mxiJ", p). 



(13) 



A more general statement also holds: 



Vt(Pi t) < supE 



sup^/(x;)-/(a;0 



< 2 sup E(x.x')~pEe 

peq3 



sup V et(/(xt(e)) - Mtip, /, x, x', e)) 



for any measurable function Mt with the property Mt{p, /, x, x', e) — Mt(p, /, x', x, — e). In particular, ( 13 ) 
is obtained by choosing Mt — 0. 



The following corollary provides a natural "centered" version of the distribution-dependent Rademacher 
complexity. That is, the complexity can be measured by relative shifts in the adversarial moves. 



Vt(^1:t)<2supE(x,x')~p 
peV 



E, 



1 

sup^e,(/(xi(e))-E,_i/(xi(e)) 



Corollary 4. For the game with restrictions Vi.t, 

T 

where Et_i denotes the conditional expectation of:x.t{e). 

Example 1. Suppose T is a unit ball in a Banach space and f{x) — {f,x). Then 

T 



Vt{Vi:t) < 2 sup E(x,x')~pE, 

P6V 



^et(xt(e)-Et_iXt(e) 



Suppose the adversary plays a simple random walk (e.g., pt{x\xi, . . . ,Xt-i) = pt{x\xt-i) is uniform on a 
unit sphere). For simplicity, suppose this is the only strategy allowed by the set Vp. Then xt(e) — Et_iXt(e) 



7 



are independent increments when conditioned on the history. Further, the increments do not depend on ej. 
Thus, 



Vt{Vi:t) < 2E 
where {Yt} is the corresponding random walk. 



4 Analyzing Rademacher Complexity 



The aim of this section is to provide a better understanding of the distribution-dependent sequential 
Rademacher complexity, as well as ways of upper-bounding it. We first show that the classical Rademacher 
complexity is equal to the distribution-dependent sequential Rademacher complexity for i.i.d. data. We 
further show that the distribution-dependent sequential Rademacher complexity is always upper bounded 
by the worst-case sequential Rademacher complexity defined in 

It is already apparent to the reader that the sequential nature of the minimax formulation yields long 
mathematical expressions, which are not necessarily complicated yet unwieldy. The functional notation and 
the tree notation alleviate much of these difficulties. However, it takes some time to become familiar and 
comfortable with these representations. The next few results hopefully provide the reader with a better feel 
for the distribution-dependent sequential Rademacher complexity. 

Proposition 5. Consider the i.i.d. restrictions Vt — {p} for all t, where p is some fixed distribution on X . 
Let p he the process associated with the joint distribution p — p'^ . Then 

uiTiT,p)^mTiT,p) 

where 

is the classical Rademacher complexity. 
Proof. By definition, we have. 



sup Vet/ (xt) 



(14) 



sup Vet/(xf(e)) 



(15) 



In the i.i.d. case, however, the tree generation according to the p process simplifies: for any e G {±1}-^,^ e 

m, 

(xt(e),x;(e)) ^pxp . 



Thus, the 2 • (2^ — 1) random variables Xf(e),Xj(e) are all i.i.d. drawn from p. Writing the expectation (15) 
explicitly as an average over paths, we get 

1 ^ ^ 

^t(-F,p) = -^ J2 E(x,x')~p 



sup Vet/(xt(e)) 



2^ 



E E 



sup V etf{xt) 



snp^etfixt) 
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The second equality holds because, for any fixed path e, the T random variables {'x.t{t)} te[T] have joint 
distribution . □ 

Proposition 6. For any joint distribution p, 
where 



5nT(^) = sup E 

X 

is the sequential Rademacher complexity defined in \1 1}/ . 



sup Vet/ (xt) 



(16) 



Proof. To make the p process associated with p more explicit, we use the expanded definition: 



< sup Eej^ sup 

Xi^x'^ X2.X'<y 



sup Eey 

XTjX'j. 



lEa;^ ,a;y~pr(-|xi(ci),--->XT-l(er-l))'^eT 



sup Vet/(xt) 



sup Vet /(xt) 



(17) 



sup Eej sup Egg ... SUpEgy 

Xi X2 XT 



sup Vet/ (xt) 



The inequality holds by replacing expectation over a;f,a;t by a supremum over the same. We then get rid of 
Xt's since they do not appear anywhere. □ 



An interesting case of hybrid i.i.d. -adversarial data is considered in Lemma |17[ and we refer to its proof as 
another example of an analysis of the distribution-dependent sequential Rademacher complexity. 

We now turn to general properties of Rademacher complexity. The proof of next Proposition follows along 
the lines of the analogous result in [llj . 

Proposition 7. Distribution- dependent sequential Rademacher complexity satisfies the following properties. 

1. IfTcg,thenm{T,p)<Vi{g,p). 

2. 9liT,p) = 5H(conv(J"),p). 

3. m{cT, p) = \c\miT, p) for all ceR. 

4. For any h, <n( J" + h,p)^ J", p) where T + h = {f + h : f e T} 

Next, we consider upper bounds on U{(T,p) via covering numbers. Recall the definition of a (sequential) 
cover, given in [TT]. This notion captures sequential complexity of a function class on a given A"- valued tree 

X. 

Definition 3. A set V of M-valued trees of depth T is an a-cover (with respect to £p-norm) of C R'^ on 
a tree x of depth T if 



V/ e J", Ve e {±1}^ 3veV s.t. 



The covering number of a function class on a given tree x is defined as 

Mp{a,T ,'x) = min{|V^| : is an a — cover w.r.t. ^p-norm of J- on x}. 
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Using the notion of the covering number, the following result holds. 
Theorem 8. For any function class T C [—1, , 

5Ht(^,p) <E(x^x')~pinf |4Ta + 12y" log AA2((5,^,x) d-jj . 

The analogous result in [11] is stated for the worst-case adversary, and, hence, it is phrased in terms of the 
maximal covering number sup,;. Ni {S, x) . The proof, however, holds for any fixed x, and thus immediately 
implies Theorem [sj If the expectation over (x, x') in Theorem [s] can be exchanged with the integral, we pass 
to an upper bound in terms of the expected covering number ]E(x,x')~p-^2('5, J^, x). 

The following simple corollary of the above theorem shows that the distribution-dependent Rademacher 
complexity of a function class J-' composed with a Lipschitz mapping can be controlled in terms of the 
Dudley integral for the function class T itself. 

Corollary 9. Fix a class J- C [—1, 1]^ and a function (j> : [—1, 1] x Z i— > M. Assume, for all z ^ Z, (/>(•, z) 
is a Lipschitz function with a constant L. Then, 




where 4>{F) = {z ^ <l>{f[z), z) : f ^ F}. 

The statement can be seen as a covering-number version of the Lipschitz composition lemma. 



5 Constrained Adversaries 

In this section we consider adversaries who are constrained in the sequences of actions they can play. It is 
often useful to consider scenarios where the adversary is worst case, yet has some budget or constraint to 
satisfy while picking the actions. Examples of such scenarios include, for instance, games where the adversary 
is constrained to make moves that are close in some fashion to the previous move, linear games with bounded 
variance, and so on. Below we formulate such games quite generally through arbitrary constraints that the 
adversary has to satisfy on each round. 

Specifically, for a T round game consider an adversary who is only allowed to play sequences Si, . . . ^xt 
such that at round t the constraint Ct{xi, . . . ,Xt) = 1 is satisfied, where : A"* i— >■ {0,1} represents the 
constraint on the sequence played so far. The constrained adversary can be viewed as a stochastic adversary 
with restrictions on the conditional distribution at time t given by the set of all Borel distributions on the 
set 

Xt{xi:t-i) = {x e X : Ct{xi,. . . ,Xt-^i,x) ^ 1). 

Since set includes all point distributions on each x ^ Xt, the sequential complexity simplifies in a way similar 
to worst-case adversaries. We write Vt{Ci;t) for the value of the game with the given constraints. Now, 
assume that for any xi-t-i^ the set of all distributions on Xt{xi.t^i) is weakly compact in a way similar to 
compactness of V. That is, Vt{xi;t~^i) satisfy the necessary conditions for the minimax theorem to hold. We 
have the following corollaries of Theorems [l] and [3j 

Corollary 10. Let F and X be the sets of moves for the two players, satisfying the necessary conditions for 
the minimax theorem to hold. Let {Ct '■ A"*"^ i— >■ {0,1}}^]^ be the constraints. Then 



Vt(Ci.t) = supE 



■ T T ■ 

J2 M E..^P. [Mxt)] - inf V f{xt) 

.4=1 -^"^ ■'^ t=l 



(18) 



where p ranges over all distributions over sequences (xi, . . . ,xt) such that Ct{xi;t~i) = 1 for all t. 
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Corollary 11. Let the set T be a set of pairs (x,x') of X -valued trees with the property that for any 
e e {±1}^ and any t G [T] 

C(xi(ei), ■ • ■ ,Xt-i(et-i),Xt(e)) = C(xi(ei), • ■ • , Xt-i(et-i), x;(e)) = 1 

The minimax value is bounded as 

Vt(Ci:t) < 2 sup mT{T,p). 
(x,x')er 



More generally, 



Vt(Ci:t) < supE sup|v/(x;)-/(xo| 



< 2 sup Ee 
(x,x')er 



sup Vet(/(xt(e))-Mt(/,x,x',e)) 



for any measurable function Mt with the property Mt{f, x, x', e) — Mt{f, x', x, — e) 



Armed with these results, we can recover and extend some known results on online learning against budgeted 
adversaries. The first result says that if the adversary is not allowed to move by more than at away from 
its previous average of decisions, the player has a strategy to exploit this fact and obtain lower regret. For 
the £2-norm, such "total variation" bounds have been achieved in 3] up to a log T factor. We note that in 
the present formulation the budget is known to the learner, whereas the results of are adaptive. Such 
adaptation is beyond the scope of this paper. 

Proposition 12 (Variance Bound). Consider the online linear optimization setting with T — {f : < 
i?^} for a X-strongly function ^' : i— >■ K-(_ on T , and X = {x : \\x\\^ < 1}. Let f{x) = (/, a;) for any f G 
and X E X . Consider the sequence of constraints {Ct}f^i given by 



, Xt-l, x) 



otherwise 



Then 



Vt(Ci.t) < inf <^ 

a>0 a 



J2'^t \<2V2R^ 



t=i 



In particular, we obtain the following L2 variance bound. Consider the case when ^> : J- t-^ M_|_ is given by 
^(/) = ^II/IP, F = {f : II/II2 < 1} and X = {x : \\x\\2 < 1}. Consider the constrained game where the 
move Xt played by adversary at time t satisfies 



In this case we can conclude that 



1 '"^ 



< at . 



Vt{Ci:t) < 2V2. 



We can also derive a variance bound over the simplex. Let ^'(/) = X]i=i fi ^og{dfi) is defined over the d- 
simplex J^, and X = {x : ||a;|loo !}• Consider the constrained game where the move Xt played by adversary 
at time t satisfies 



max 



1 



11 



For any f E T, ^'(/) < log(rf) and so we conclude that 



T 



Vt(Ci:t) < 2v^. log(d)^a2 . 



The next Proposition gives a bound whenever the adversary is constrained to choose his decision from a 
smaU ball around the previous decision. 

Proposition 13 (Slowly-Changing Decisions). Consider the online linear optimization setting where adver- 
sary's move at any time is close to the move during the previous time step. Let T = {f : < R^} where 
^' : I— )• M+ is a X-strongly function on T and X — {x : \\x\\^ < B}. Let f{x) = {f,x) for any f (z J- and 
X £ X . Consider the sequence of constraints {Ct}J^i given by 



In particular, consider the case of a Euclidean-norm restriction on the moves. Let ^! : J- t-^ M_|_ is given by 
^'(/) = J- — {f : II/II2 < 1} and X = {x : \\x\\2 < 1}. Consider the constrained game where the 

move Xt played by adversary at time t satisfies — Xf-iHj < 6 . In this case we can conclude that 



For the case of decision-making on the simplex, we obtain the following result. Let VE'(/) = fi^og{dfi) 
is defined over the d-simplex J-', and X = {x : ||a;||oo ^ 1}- Consider the constrained game where the move Xt 
played by adversary at time t satisfies \\xt — Xt~i\^ < S. In this case note that for any f E ^(/) < log((i) 
and so we can conclude that 



6 The I.I.D. Adversary 

In this section, we consider an adversary who is restricted to draw the moves from a fixed distribution p 
throughout the game. That is, the time-invariant restrictions are Vt{xi-t-i) — {p}. A reader will notice 
that the definition of the value in ^ forces the restrictions Vi.t to be known to the player before the game. 
This, in turn, means that the distribution p is known to the learner. In some sense, the problem becomes 
not interesting, as there is no learning to be done. This is indeed an artifact of the minimax formulation in 
the extensive form. To circumvent the problem, we are forced to define a new value of the game in terms 
of .strategies. Such a formulation does allow us to "hide" the distribution from the player since we can talk 
about "mappings" instead of making the information explicit. We then show two novel results. First, the 
regret-minimization game with i.i.d. data when the player does not observe the distribution p is equivalent 
(in terms of learnability) to the classical batch learning problem. Second, for supervised learning, when it 
comes to minimizing regret, the knowledge of p does not help the learner for some distributions. 

Let us first define some relevant quantities. Similarly to (|6]), let s = {st}JLi be a T- round strategy for the 
player, with st : (J^ x — > Q. The game where the player does not observe the i.i.d. distribution 




Then, 




Vt{Ci:t) < 2SV2f . 



Vt(Ci:t) < 2Sy/2Tlogid) . 
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of the adversary will be called a distribution-blind i.i.d. game, and its minimax value will be called the 
distribution-blind minimax value: 



V^""'^ ^ infsup 



.E 



. t = l 



Furthermore, define the analogue of the value ^ for a general (not necessarily supervised) setting: 



V}^"^"^ = inf sup <; E/t - intE/ 



It pev 



For a distribution p, the value ([5| of the online i.i.d. game, as defined through the restrictions Vt = {p} for 
all t, will be written as Vt({p})- For the non-blind game, we say that the problem is online learnable in the 
i.i.d. setting if 

supVt(M) ^0 . 



We now proceed to study relationships between online and batch learnability. 



6.1 Equivalence of Online Learnability and Batch Learnability 



Theorem 14. For a given function class T , online learnability in the distribution-blind game is equivalent 
to batch learnability. That is, 

^ymnd -J ^^ly y batch _^ Q 

Proof of Theorem \14\ With a proof along the lines of Proposition [2] we establish that 

i^^^E /(-)]} 
I 

t=i J ) 



inf sup < -^E^,,.. 



-P%t~st(a:i:t-i Ji.t-i)[/t(2;t)] - ^xi,...,XT~p 



> infsup < Ea;i....,XT~p 
s p 



f^^ft~st(x,,...,xt-i) l^xr^P [ft{x)]] 



inf E 



xi^.. .,xt^P 



where in the second line we passed to strategies that do not depend on their own randomizations. The 
argument for this can be found in the proof of Proposition [2] The last expression can be conveniently 
written as 



1 



V^""'^> infsup <jE,,,...,,^ 



{Xi,...,Xj.) 



The above implies that if '"^ = o{T) (i.e. the problem is learnable against an i.i.d adversary in the 
online sense without knowing the distribution p), then the problem is learnable in the classical batch sense. 
Specifically, there exists a strategy s — {st}J^i with st : A''"^ i— >■ Q such that 

[E.~P [/(^)]]] - inf^E,~p = 0(1). 

This strategy can be used to define a consistent (randomized) algorithm /t : H> as follows. Given 
an i.i.d. sample xi, . . . ,a;T, draw a random index r from 1, . . . ,r, and define Jt as a random draw from 
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distribution Sr{xi, . . . ,Xr-i)- We have proven that V^'^*'^'^ — > as T increases, which the requirement of 
Eq. ([2]) in the general non-supervised case. Note that the rate of this convergence is upper bounded by the 
rate of decay of t^V^''"'^ to zero. 

To show the reverse direction, say a problem is learnable in the classical batch sense. That is, Vy^''^^ —?■ 0. 
Hence, there exists a randomized strategy s = (si, S2, . . .) such that st : X*~^ Q and 

sup I^Ex,.,...,xt-i^p [Efr~.st(xu...,xt-i)^x^P [fix)]] - mf,E^~p [/(a;)] I = o(l) 
as t — > cx). Hence we have that 

sup|e^i,...,^^^p ^YlEf^s,ix,,....xt_,)T^x^p[fix)] " mf,E^^p [/(x)] | 

1 ^ f 



f-1 p 



Efr^stixi,...,xt.i)Ex^p [fix)] - M Ex^p [fix)] 



oil) 



because a Cesaro average of a convergent sequence also converges to the same limit. 
As shown in [13j , the problem is learnable in the batch sense if and only if 

T 



^xi ,.. .,xt^ 



inf E:r~p [fix) 



and this rate is uniform for all distributions. Hence we have that 



sup < Ex^,...,XT~p 



I T 1 ^ 11 

f ^f-^s,{xu-,x,^i)^x~P [fi^)] - inf, y f = 



We conclude that if the problem is learnable in the i.i.d. batch sense then 

T T 



o{T) = supE^^^..._j;^^p 



p 



SUpEx^,.. .,XT~p 



[fi^)] ~ j^lrY'l'i 
t=l ^ t = l 

T T 



Xt 



t=l 



SWpExi,....,XT~pEfi^Si ■ ..Ef^^sTi. 

p 



{T ^1 



> V- 



blind 



(19) 



Thus we have shown that if a problem is learnable in the batch sense then it is learnable versus all i.i.d. 
adversaries in the online sense, provided that the distribution is not known to the player. 

□ 

At this point, the reader might wonder if the game formulation studied in the rest of the paper, with the 
restrictions known to the player, is any easier than batch and distribution-blind learning. In the next section, 
we show that this is not the case for supervised learning. 
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6.2 Distribution-Blind vs Non-Blind Supervised Learning 



In the supervised game, at time t, the player picks a function ft G [—1, 1]'^, the adversary provides input- 
target pair (xf , yt), and the player suffers loss \ft{xt) — yt\- The value of the online supervised learning game 
for general restrictions Vi.t is defined as 



V™P(7'i:t) = inf sup E ••• inf sup E 



where (xt, yt) has distribution pt. As before, the value of an i.i.d. supervised game with a distribution pxxY 
will be written as (pxxy)- 

Similarly to Eq. ([2]), define the batch supervised value for the absolute loss as 



V, 



batch, sup 



/ PxxY 

and the distribution-blind supervised value as 



inf sup E\y - fix)\ - inf^E|y - f{x)\ 



(20) 



blind, sup 



inf sup 



E 



Zi,...,ZT~P^fl~S 



.E 



{t=l t=l 



where we use the shorthand zt — {xt,yt) for each t. 
Lemma 15. In the supervised case, 



1 



-TV, 



batch, sup 



< snpmriT^px) < sup V|,"P({pjf x Uy}) < sup V™P(fexy}) < V^' 

PX PX PXxY 



blind, sup 



where *Ht(-^,Px) is the classical Rademacher complexity defined in (14), and Uy is the Rademacher distri- 
bution. 
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specialized to the supervised setting, says that ;lv^''"'^' ''"^ — > if and only if V^'**'^'^' — ^ 



Theorem 

Since sup^^^^ ^'^j^^{{pxxy}) is sandwiched between these two values, we conclude the following. 



Corollary 16. Either the supervised problem is learnable in the batch sense (and, by Theorem 1^ in the 
distribution-blind online sense), in which case supp^^^ ^t^^Upxxy}) = o(T). Or, the problem is not learn- 



able in the batch (and the distribution-blind sense), in which case it is not learnable for all distributions in 



the online sense: supp '^t^^ {{pxxy}) does not grow sublinearly. 



Proof of Lemma 15 The first statement follows from the well-known classical symmctrization argument: 



^batch, sup ^ .^^^ ^ _ ^(^^1 _ _ ^^^^)| 



/ PxxY 



PXxY 



< sup E|y-/>)|- inf Ely- /(x) I 



< 2 sup E sup 
PxxY feJ^ 



^-J2\yt~fi^t)\-E\y-fix)\ 
t=i 

1 ^ 

< 4supEa;j^^E,j^y sup — ^e4/(xt) 



t=i 



where the first inequality is obtained by choosing the empirical minimizer / as an estimator. 
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The second inequality of the Lemma follows from the lower bound proved in Section 7.1 Lemma 20 implies 
that the game with i.i.d. restrictions Vt = {px x Uy} for all t satisfies 



for any px ■ 

Now, clearly, the distribution-blind supervised game is harder than the game with the knowledge of the 
distribution. That is, 

Vsup / r ^ \ ^ blind, sup 

. T {{PXxy}) < 



PXxY 



□ 



7 Supervised Learning 

In Section [6j we studied the relationship between batch and online learnability in the i.i.d. setting, focusing 
on the supervised case in Section [6.2[ We now provide a more in-depth study of the value of the supervised 
game beyond the i.i.d. setting. 

As shown in [Tl] [T^, the value of the supervised game with the worst-case adversary is upper and lower 
bounded (to within O(log'^^^T)) by sequential Rademacher complexity. This complexity can be linear in 
T if the function class has infinite Littlestone's dimension, rendering worst-case learning futile. This is the 
case with a class of threshold functions on an interval, which has a Vapnik-Chervonenkis dimension of 1. 
Surprisingly, it was shown in [7 that for the classification problem with i.i.d. x's and adversarial labels 
y, online regret can be bounded whenever VC dimension of the class is finite. This suggests that it is the 
manner in which x is chosen that plays the decisive role in supervised learning. We indeed show that this 
is the case. Irrespective of the way the labels are chosen, if xt are chosen i.i.d. then regret is (to within a 
constant) given by the classical Rademacher complexity. If x^s are chosen adversarially, it is (to within a 
logarithmic factor) given by the sequential Rademacher complexity. 

We remark that the algorithm of [7] is "distribution-blind" in the sense of last section. The results we 
present below are for non-blind games. While the equivalence of blind and non-blind learning was shown 
in the previous section for the i.i.d. supervised case, we hypothesize that it holds for the hybrid supervised 
learning scenario as well. 

Let the loss class be (j>{J^) = {{x,y) H' 4>{f{x),y) : / € J"} for some Lipschitz function </) : M x 3^ i-> E (i.e. 
4'ifix):y) = \ f{x) ~ y\)- Let Vi:T be the restrictions on the adversary. Theorem |3] then states that 

VJ"P(7'i:t) < 2sup mrim^P) 

where the suprcmum is over all joint distributions p on the sequences {{xi,yi), . . . , {xT,yT)), such that p 
satisfies the restrictions Vi.t- The idea is to pass from a complexity of (j){J^) to that of the class T via a 
Lipschitz composition lemma, and then note that the resulting complexity does not depend on y-variables. 
If this can be done, the complexity associated only with the choice of x is then an upper bound on the value 
of the game. The results of this section, therefore, hold whenever a Lipschitz composition lemma can be 
proved for the distribution-dependent Rademacher complexity. 

The following lemma gives an upper bound on the distribution-dependent Rademacher complexity in the 
"hybrid" scenario, i.e. the distribution of xt's is i.i.d. from a fixed distribution p but the distribution of 
yt's is arbitrary (recall that adversarial choice of the player translates into vacuous restrictions Vt on the 
mixed strategies). Interestingly, the upper bound is a blend of the classical Rademacher complexity (on 
the a;-variable) and the worst-case sequential Rademacher complexity for the y- variable. This captures the 
hybrid nature of the problem. 
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Lemma 17. Fix a class T C and a function ip : M. x y t-^ M.. Given a distribution p over X , let 
*P consist of all joint distributions p such that the conditional distribution p^'^ {xt,yt\x*~^ ,y*^^) ^ p{xt) x 
Pt{,yt\x^~^ ,y^~^ iXt) for some conditional distribution pt ■ Then, 

sup<nT((/'(-7^),p) < E supE, 

pgqj Xx,.../XT~P y 

Armed with this result, we can appeal to the following Lipschitz composition lemma. It says that the 
distribution-dependent sequential Rademacher complexity for the hybrid scenario with a Lipschitz loss can 
be upper bounded via the classical Rademacher complexity of the function class on the x- variable only. That 
is, we can "erase" the Lipschitz loss function together with the (adversarially chosen) y variable. The lemma 
is an analogue of the classical contraction principle initially proved by Ledoux and Talagrand [8 for the i.i.d. 
process. 

Lemma 18. Fix a class T C [—1, Vf^ and a function (f> : [—1, 1] x 3^ > M. Assume, for all y £ y, (f>{-, y) is 
a Lipschitz function with a constant L. Let *P be as in Lemma \T^ Then, for any p G 

dKT{c^[F),p)<L^T{F,p) . 



T 

sup^et(l){f{xt),ytie)) 



Lemma 17 in tandem with Lemma 18 imply that the value of the game with i.i.d. x's and adversarial y's is 
upper bounded by the classical Rademacher complexity. 

For the case of adversarially-chosen x's and (potentially) adversarially chosen j/'s, the necessary Lipschitz 
composition lemma is proved in [11) with an extra factor of 0(log'^^'^ T). We summarize the results in the 
following Corollary. 

Corollary 19. The following results hold for stochastic- adversarial supervised learning with absolute loss. 



• If Xt are chosen adversarially, then irrespective of the way yt 's are chosen, 

< 2miT) X o(iog=^/'(r)), 

where 9^(7-") is the (worst-case) sequential Rademacher complexity \1 _?) /. A matching lower bound of 
?l(J^) is attained by choosing yt 's as i.i.d. Rademacher random variables. 

• If Xt are chosen i.i.d. from p, then irrespective of the way yt's are chosen, 

< 2m{j^,p), 



where fH(J^, p) defined in ( 14 ) is the classical Rademacher complexity. The matching lower bound of 
9\{J-,p) is obtained by choosing yt's as i.i.d. Rademacher random variables. 

The lower bounds stated in Corollary [19] are proved in the next section. 



7.1 Lower Bounds 



We now give two lower bounds on the value Vj"^, defined with the absolute value loss function <j){f{x),y) = 
\f{x) — y\. The lower bounds hold whenever the adversary's restrictions {Vt}J=i allow the labels to be i.i.d. 
coin flips. That is, for the purposes of proving the lower bound, it is enough to choose a joint probability 
p (an oblivious strategy for the adversary) such that each conditional probability distribution on the pair 
{x, y) is of the form pt{x\xi, . . . , Xt-i) x b{y) with 6(— 1) = b{l) — 1/2. Pick any such p. 

Our first lower bound will hold whenever the restrictions Vt are history- independent. That is, Vt{xi;t~i) = 
Vt{x'i.t_i) for any xi-^-i, G X^~^ . Since the worst-case (all distributions) and i.i.d. (single distribution) 

are both history-independent restrictions, the lemma can be used to provide lower bounds for these cases. 



The second lower bound holds more generally, yet it is weaker than that of Lemma 20 
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Lemma 20. Let *p be the set of all p satisfying the history-independent restrictions {Vt} and C ^ the 
subset that allows the label yt to be an i.i.d. Rademacher random variable for each t. Then 

V^"^(7'i:t)> sup fRT(-F,p) 
pe<P' 

In particular, Lemma [20l gives matching lower bounds for Corollary [T9j 

Lemma 21. Let *P be the set of all p satisfying the restrictions {Vt} and let 'P' C &e the subset that 
allows the label yt to be an i.i.d. Rademacher random variable for each t. Then 

V™^(Pi:t) > sup E(,,,,)^pE, 

Remark 22. The supervised learning protocol is sometimes defined as follows. At each round t, the pair 
{xt,yt) is chosen by the adversary, yet the player first observes only the "side information" Xt- The player 
then makes a prediction yt and, subsequently, the label yt is revealed. The goal is to minimize regret defined 
as 

T T 

^\yt-yt\- inf. X! ~ 
t=i ^'^ t=i 

As briefly mentioned in 11 IL this protocol is equivalent to a slightly modified version of the game we consider. 
Indeed, suppose at each step we are allowed to output any function f : X ^ y (not just from T), yet regret is 
still defined as a comparison to the best f Cz J-. This modified version is clearly equivalent to first observing 
Xt and then predicting yt. Denote by Vt the value of the modified "improper learning" game, where the 
player is allowed to choose any ft G 3^*^ . Side-stepping the issue of putting distributions on the space of 
all functions y'^ , it is easy to check that Theorem [7] goes through with only one modification: the infima in 
the cumulative cost are over all measurablejunctions ft G 3^*^. The key observation is that these ft's are 
replaced by f £ T in the proof of Theorem IM Hence, the upper bound on Vt is the same as the one on the 
"proper learning" game where our predictions have to lie inside T. 



T 

sup Vet/(xt(-l)) 



8 Smoothed Analysis 

The development of smoothed analysis over the past decade is arguably one of the hallmarks in the study of 
complexity of algorithms. In contrast to the overly optimistic average complexity and the overly pessimistic 
worst-case complexity, smoothed complexity can be seen as a more realistic measure of algorithm's perfor- 
mance. In their groundbreaking work, Spielman and Teng |14j showed that the smoothed running time 
complexity of the simplex method is polynomial. This result explains good performance of the method in 
practice despite its exponential-time worst-case complexity. 

In this section, we consider the effect of smoothing on learnability. Analogously to complexity analysis of 
algorithms, learning theory has been concerned with i.i.d. (that is, average case) learnability and with online 
(that is, worst-case) learnability. In the former, the learner is presented with a batch of i.i.d. data, while 
in the latter the learner is presented with a sequence adaptively chosen by the malicious opponent. It can 
be argued that neither the average nor the worst-case setting reasonably models real-world situations. A 
natural step is to consider smoothed learning, defined as a random perturbation of the worst-case sequence. 

It is well-known that there is a gap between the i.i.d. and the worst-case scenarios. In fact, we do not 
need to go far for an example: A simple class of threshold functions on a unit interval is learnable in the 
i.i.d. supervised learning scenario, yet difficult in the online worst-case model [Sid]. When it comes to i.i.d. 
supervised learning, the relevant complexity of a class is captured by the Vapnik-Chervonenkis dimension, 
and the analogous notion for worst-case learning is the Littlcstone's dimension OS EI]- For the simple 
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example of threshold functions, the VC dimension is one, yet the Littlestone's dimension is infinite. The 
proof of the latter fact, however, reveals that the infinite number of mistakes on the part of the player is 
due to the infinite resolution of the carefully chosen adversarial sequence. We can argue that this infinite 
precision is an unreasonable assumption on the power of a real- world opponent. It is then natural to ask: 
What happens if the adversary adaptively chooses the worst-case sequence, yet the moves are smoothed by 
exogenous noise? The scope of what is learnable is greatly enlarged if smoothed analysis makes problems 
with infinite Littlestone's dimension tractable. 

Our approach to the problem is conceptually different from the smoothed analysis of [13] and the subsequent 
papers. We do not take a particular learning algorithm and study its smoothed complexity. Instead, we ask 
whether there exists an algorithm which guarantees vanishing regret for the smoothed sequences, no matter 
how they are chosen. Using the techniques developed in this paper, learnability is established by directly 
studying the value of the associated game. 

Smoothed analysis of learning has been considered by [B], yet in a different setting. The authors study 
learning DNFs and decision trees over a binary hypercube, where random examples are drawn i.i.d. from 
a product distribution which is itself chosen randomly from a small set. The latter random choice adds an 
element of smoothing to the PAC setting. In contrast, in the present paper we consider adversarially-chosen 
sequences which are then corrupted by random noise. Further, since "probability of error" does not make 
sense for non-stationary data sources, we consider regret as the learnability objective. 

Formally, let a be a fixed "smoothing" distribution defined on some space S. The perturbed value of the 
adversarial choice x is defined by a measurable mapping uj : X x S ^ X, known to the learner. For example, 
an additive noise model corresponds to a;(a;, s) ~ x + s. More generally, we can consider a Markov transition 
kernel from a space of moves of the adversary to some information space, and the smoothed moves of the 
adversary can be thought of as outputs of a noisy communication channel. 

A generic smoothed online learning model is given by following T-round interaction between the learner and 
the adversary: 

On round t = 1, . . . , T, 

• the learner chooses a mixed strategy qt (distribution on J-) 

• the adversary picks xt ^ X 

• random perturbation st ^ a is drawn 

• the learner draws ft ^ qt and pays ft{Lo{xt, St)) 



where the infima are over qt ^ Q and the suprema are over Xf G A". A non-trivial upper bound on the above 
value guarantees existence of a strategy for the player that enjoys a regret bound against the smoothed 
adversary. We note that both the adversary and the player observe each other's moves and the random 
perturbations before proceeding to the next round. 

We now observe that the setting is nothing but a special case of a restriction on the adversary, as studied 
in this paper. The adversarial choice Xt defines the parameter Xt of the distribution from which a random 
element ui{xt, St) is drawn. The following theorem follows immediately from Theorem [l] 

Theorem 23. The value of the smoothed online learning game is hounded above as 



End 



The value of the smoothed online learning game is 



Vt 



A 
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We now demonstrate how Theorem 23 can be used to show learnabihty for a smoothed learning scenario. 
What we find is somewhat surprising: for a problem which is not learnable in the online worst-case scenario, 
an exponentially small noise added to the moves of the adversary yields a learnable problem. This shows, 
at least in the given example, that the worst-case analysis and Littlestone's dimension are brittle notions 
which might be too restrictive in the real world, where some noise is unavoidable. It is comforting that small 
additive noise makes the problem learnable! 



8.1 Binary Classification with Half-Spaces 



Consider the supervised game with threshold functions on a unit interval. The moves of the adversary are 
pairs X = {z, y) with z e [0, 1] and y G {0, 1}, and the binary-valued function class F is defined by 

T^{fe{z.v) = \v-l{z<e}\:e^ [0, 1]} . (21) 

The class T has infinite Littlestone's dimension and is not learnable in the worst-case online framework. Any 
non-trivial upper bound on the value of the game, therefore, has to depend on particular noise assumptions. 
For the uniform noise a = Unif [— 7/2, 7/2] for some 7 > 0, for instance, the intuition tells us that noise 
implies a margin. In this case we should expect a I/7 complexity parameter appearing in the bounds. 
Formally, let 

^{{z,y),(y) = {z + (j,y)- 

That is, (J uniformly perturbs the z- variable of the adversarial choice x = {z,y), but does not perturb the 
y-variable. The following proposition holds for this setting. 

Proposition 24. For the worst-case adversary whose moves are corrupted by the uniform noise Unif [—7/2, 7/2], 
the value is bounded by 

Vt<2 + (41ogr + log(l/7)) 



The idea for the proof is the following. By discretizing the interval into bins of size well below the noise 
level, we can guarantee with high probability that no two smoothed choices zt -(- st of the adversary fall into 
the same bin. If this is the case, then the supremum of Theorem [23] can be taken over a discretized set 
of thresholds. For each fixed threshold /, however, etf{uj{xt, St)) forms a martingale difference sequence, 
yielding the desired bound. We can easily generalize this idea to linear thresholds in d dimensions: Cover 
the sphere corresponding to the choices Zt and ft by balls of a small enough radius and argue that with high 
probability no two smoothed choices of the adversary fall into the same bin. By a simple volume argument. 



we claim that the supremum in Theorem 23 can be replaced by the supremum over the discretization at a 
small additional cost (the number of bins that change sign as / ranges over one bin). The result then follows 
from martingale concentration. 

Below, we prove the result for the one-dimensional case, which already exhibits the key ingredients. 



Proof of Proposition 24 For any /g e define 

Mf = etfe{ujixt,st)) = \yt -l{zt + st< 9}] . 

Note that {Mf}-t is a zero-mean martingale difference sequence, that is E[A/t|zi.t, yi-t, si-t] = 0. We conclude 
that for any fixed 6 e [0, 1], 
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by Azuma-Hocffding's inequality. Lot T' = {,fe^, . . . , fejs,} C J" be obtained by discretizing the interval [0, 1] 
into iV = bins [9i, 9i+i) of length T'", for some a > 3. Then 



P ( max V M? > e I < TVexp — 1 



Observe that the maximum over the discretization coincides with the supremum over the class if no two 
elements zt + and Zf + Sf fall into the same interval [9i, 0i+i). Indeed, in this case all the possible values 
of on the set {z^ + si, . . . , zt + st} are obtained by choosing the discrete thresholds in T'. Since there 

are many intervals and we are choosing T, the probability of no collision is close to 1. 

Let us calculate the probability that for no distinct t, t' G [T] do we have Zt + St and Zt' + Sf in the same bin. 
We can deal with the boundary behavior by ensuring that J" is in fact a set of thresholds that is 7/2-away 
from or 1, but we will omit this discussion for the sake of clarity. The probability that no two elements 
Zt + St and Zf + Sf fall into the same bin depends on the behavior of the adversary in choosing zt's. Keeping 
in mind that the distribution of all St's is uniform on [—7/2,7/2], we see that the probability of a collision 
is maximized when Zt is chosen to be constant throughout the game. 

If Zt's are all constant throughout the game, we have T balls falling uniformly into 7T° > T bins. The 
probability of two elements Zt + and Zt + Sf falling into the same bin is 



P (no two balls fall into same bin) > 1 - „ „ 

^ — -/no— 2 



o-T^fTT" - 1) • • • fo-T" - T) /'yT" - T\ ^ / 1 ' 

P (no two balls fall into same bin) = - — ^- ' ^- > =1 , 

^ ' ryT°- ■ jT°- ■ ■ ■ ^T°- ~ \ 7T'' J \ jT'^-i 

The last term is approximately exp { — 1/(7T"~^)} for large T, so 

1 

7^ 

using > 1 — x. Now, 

(T \ / T 

sup etf{u){xt, St)) > e 1 < -P sup etf{u>{xt, St)) > e A none of {zt + St)'s fall into same bin 

+ P (some of {zt + st)'s fall into same bin) 



= P max 2_] > e A none of {zt + St)'s fall into same bin 
< P I max V Mf > e 1 + J „ 



7T' 



a-2 



< T" exp <^ - 



a-2 



2T J 7T 

Using the above and the fact that for any / e J^, | Yl^=i ^tf{'^{xt,St))\ <T we can conclude that 



Vt < E 



T 

sup Vet/(w(a;t,st)) 



< e + T"+i exp 
Setting e = ■\/2{a + l)TlogT we conclude that 



2T 



j^S—a 



Vt < 1 + V2(a + 1)T logT + 
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Now pick a = 3 + ^"f^^j^^ (this choice is fine because 7T° i = y2 -^pj^jg}^ grows with T as needed for the 
previous approximation). Hence we see that 

= 2+ V2r(41ogT + log(l/7)) 

□ 

While the infinite Littlcstone dimension of threshold functions seemed to indicate that half spaces are not 
online learnable, the analysis shows that very slight perturbations (in fact even exponentially small in T) 
are enough to make half spaces online learnable, so in practice half spaces can be used for classification in 
the smoothed online setting. 

We note that our learnability analysis was based on an upper bound on the value of the game. The inefficient 
algorithm can be recovered from the minimax formulation directly. However, for the particular problem of 
smoothed learning with half-spaces, the exponential weights algorithm on the discretization of the interval 
will also do the job. An alternative analysis can directly focus on this algorithm and use the same bins- 
and-balls proof to show that the loss of any expert is likely to be close to the loss of any non-discretizcd 
threshold. 
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Appendix 



Proof of Theorem [7} The proof is identical to that in llj. For simplicity, denote ^P{xi;t) = inf/eJF' Ym=i fi^t)- 
The first step in the proof is to appeal to the minimax theorem for every couple of inf and sup: 



inf sup E • • • inf sup E 

9ieQpjg-pj /l~<?l qx&Q pti^Vt fT~gT 



.t=l 



sup inf E ... sup inf E 

pieVilieQ fi~qi p^tz-p^qr&Q /t~9t 



sup inf E^rj^pj . . . sup inf Ej^^^p^ 



T 

^ ft{xt) - iP{x1:t) 



^Mxt) - i^ixi-.r) 



,t=l 



From now on, it will be understood that xt has distribution pt and that the suprema over pt are in fact over 
Pt € Vtixi-.t^i)- By moving the expectation with respect to xt and then the infimum with respect to 
inside the expression, we arrive at 



supinf Ej;j . . . sup inf E^j.^-^ sup 

pi fi Pt-1 fT — i Pt 



T-1 



supinf Ej-j . . . sup inf E^y_j supE^,, 

pi fi Pt-1 Pt 



r-i 



inf E^^frixT) 

It 



t=l 



inf E^^frixr) 

It 



Ex^tl:{xi;T) 

1p{xi.,T 
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Let us now repeat the procedure for step T — 1. The above expression is equal to 



supinf Ea;j . . . sup inf Exj,_-^ 

Pl /l PT-l 



X! M^t) + supEi 



Pt 



St 



sup inf Ej; J . . . sup 

Pl Pt-1 



'T-2 



inf E^^_Jt-\{xt-\) 

JT-I 



Ej.^_j supEa: 



inf Ea.^/T(a;T) - i/'(a;i:T) 



supinf Ea;j . . . sup Ea;y_j supEa;^, 

Pl /i Pt-1 Pt 



'T— 2 



t=i 



inf E2,^_j/T-i(a:;T-i) 

/t-1 



inf E^y/T(a;T) 

/t 



i'ixv.T) 



Continuing in this fashion for T — 2 and aU the way down to t = 1 proves the theorem. 



□ 



Proof of Proposition^ Fix an obhvious strategy p and note that Vt('Pi:t) > V^. From now on, it will 
be understood that xt has distribution pt{-\xi;t-i). Let tt = {TTt}t=i be a strategy of the player, that is, a 
sequence of mappings ttj : (J^ x X)*'^^ i-> Q. 

By moving to a functional representation in Eq. ([o]), 



= inf Ef,^^,E, 



/i~7riUl>2;i~pi • ■ ■ lEjy^7rT(-|/i:T-l,a;i:T-l)-'^a;T~PT(-|2:i:T-l) 



■ T T ■ 

E/t(^0- inf E/(^*) 

t=l ■'^ t=l 



Note that the last term does not depend on /i, . . . , /t, and so the expression above is equal to 

inf < E/j^Tri]Exi~pi ■ ■ ■ ^fT~-^T(-\fl:T-UXi.,T-l)^XT~PT{-\xi:T-l) E /t(^*) 



inf <! E/,^^,E:ri~pi ■■■E/^^^^(.|/^^^_^_^^^^_^)E^^^p^(.|^^^^_^) ^ft{xt) \ - {E 

.t=i 

Now, by linearity of expectation, the first term can be written as 

= |E'^/i~^i'^^1~P1 ■ ■ ■'^St~^MS^:t-l,X^:t-x)'^X,^pMx^-.t-l)h{Xt^ 



)n^E/(-o 

' t=i 



inf El^ 



a:i~Pl ■ • ■ ^Xtr^pt{-\xx.,t-\) 



E/l-TTi ■ ■■'^ft^T:M!l:t-l,X^:t-l)h{Xt) 



(22) 



Now notice that for any strategy tt — {TTt}^]^, there is an equivalent strategy tt' = {ttJ}^]^ that (a) gives 
the same value to the above expression as tt and (b) does not depend on the past decisions of the player, 
that is TTj : X^~^ i— )■ Q. To see why this is the case, fix any strategy tt and for any t define 

7rJ(-|a:i:t_l) = E/^^^^ . . . Ey^_j^^^(.|j^^^_2^3,j^^_2)7rt(-|/l:t_l, Xl:t_l) 

where we integrated out the sequence /i, . . . , ft-\- Then 

%l~'ri ■ • ■ '^ft~ixMfl:t-l-x^-.t-l)h{xt) = E/t~<(.|2;i,t_i)/t(2;t) 
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and so tt and tt' give the same value in ( 22 ) 



We conclude that the infimum in ( 22 1 can be restricted to those strategies tt that do not depend on past 



randomizations of the player. In this case, 
r T 



V| inf < ^ Ex^~p^ ■ ■ ■ E:E,^p,(.|:ri,,_i)%~^t(-kl:t-l)/t(^*) 

U=i 



E 



- <^ E 



t=i 

T 



= infE 



.*=i t=i 



Now, notice that we can choose the Bayes optimal response ft in each term: 

T T 



V? = infE 



> infE 



,t=i 

■ T 



t=l 



^ inf E^^^pjt{xt) - j^i y,f{xt] 
t=i ^ t=i 



E 



Together with Theorem [T] this implies that 
yP* ^ Vt{Vi:t) = infE 



■ T T ■ 

t=i t=i 



for any p* achieving supremum in ([8]). Further, the infimum is over strategies that do not depend on the 
moves of the player. 

We conclude that there is an oblivious minimax optimal strategy of the adversary, and there is a correspond- 
ing minimax optimal strategy for the player that does not depend on its own moves. 

□ 



Proof of Theorem\3[ From Eq. m 



Vt — sup E 



^ inf E,^i[/,(x,)]- inf 
.t=i ^ t=i 

sup inf Ei_i[/t(^t)]-/(^t)| 

<supE snp ly^Et-iifixt)] - f{xt)\ 



sup E 



(23) 



The upper bound is obtained by replacing each infimum by a particular choice /. Note that Et_i [/(a;*)] — 
f{xt) is a martingale difference sequence. We now employ a symmetrization technique. For this purpose, 
we introduce a tangent sequence {x[}f^i that is constructed as follows. Let x'l be an independent copy of 
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xi. For t>2, let x[ be both identically distributed as Xt as well as independent of it conditioned on xi-t^i. 
Then, we have, for any t S \T] and / G 



(24) 



The first equality is true by construction. The second holds because x'^ is independent of xt-.T conditioned 
on Xi;t-i- We also have, for any t G \T] and f € J^, 



fixt) = Et [fixt)] 



(25) 



Plugging in (24) and (25) into (23), we get. 



Vt < sup E 



sup E 



sup<^^Et [f{x[)]~ET [f{xt)] 



sup < Et 



^/(x;)-/(x,) 



,t=i 



< supE 



supK]/(x;)-/(a;t) 



For any p, the expectation in the above supremum can be written as 
C T 



E 



sup|^/(a;;)-/(xo| 



Now, let's see what happens when we rename xi and x'l in the right-hand side of the above inequality. The 
equivalent expression we then obtain is 



^x[.Xir^Pl^X2.x'2~P2{-\x[)^X3,x'^~P3{-\x[,X2) ■ ' ■ ^Xt -X'^^PT {-{x^ ,X2:T - l) 



sup j -{fix',) - fix,)) + J2 fiO - 



t=2 



Now fix any e S {±1}^. Informally, = 1 indicates whether we rename Xt and Xj. It is not hard to verify 
that 



'^Xi,x'^r^p-i^X2,x'2~P2(-\xi) ■ ■ ■ ^Xt,x'^~Pt(-\xi,...,Xt-i) 



supO]/K)-/(^t) 



^xi,x'^r^pi^X2.x'2r^P2(-\xi(-'i-)) ■ ■ ■ Ext,3:^~pt(-|xi(-1),-.-,Xt-i(-1)) 



^xi,x'^^pi^X2,x'2~P2(-\xi{<^l)) ■ • ■ Ea;T,a;^~pr(-|xi(ci),--->XT-l(er-l)) 



supK]-6i(/(x;)-/(a;,)) 



(26) 
(27) 



Since Eq. (26) holds for any e £ {±1}-^, we conclude that 
E 



(28) 



'^<i^Xi,x'^~pi'^X2,x'2~P'i(-\xl(M)) ■ ■ - ^XT ,x'^~pt(-\xi(m) ,---,XT -l{^T 



— ^xi,x'-^r^pi^ti^X2,x'2'^P'ii'\Xl(<^l))^<^^ ■ ■ ■ Ea:T,a;y~PT(-|Xl(ei).---.XT-l(eT-l))Eer 



svcp\Y^-etifix',)-fixt)) 
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The process above can be thought of as taking a path in a binary tree. At each step t, a coin is flipped 
and this determines whether xt or x 'f is to be used in conditional distributions in the foflowing steps. This 
is pr ecis ely the process outlined in (12 1. Using the definition of p, we can rewrite the last expression in 
Eq. (|28l) as 



■ '^^T-l^(xT,x'r,,)~P'j-{t){^{x-i,x'^),...,(xT-l,x'^_i))^'^T 



sup < y'et(/(a;tj 



More succinctly, Eq. ( 28 1 can be written as 

( T 



E 



(x,x')~p 



IE(x,x')~pII^e 



sup 1^ 

.^■^^ U=i 



6,(/(x,(e))-/(x;(e))) 



(29) 



It is worth emphasizing that the values of the mappings x, x' are drawn conditionally-independently, however 
the distribution depends on the ancestors in both trees. In some sense, the path e defines "who is tangent to 
whom" . 

We now split the supremum into two: 



E(x,x')~pIEe 



sup|^e,(/(x,(e))-/(x;(e))) 



< E(,^,,)^pE, 



2E(,,,,)^pE, 



T 

sup Vet/(xt(e)) 



sup Vet/(xt(e)) 



E(x,x')~pII^c 



T 

sup^- 



(30) 



The last equality is not difficult to verify but requires understanding the symmetry between the paths in 
the X and x' trees. This symmetry implies that the two terms in Eq. (30 1 are equal. Each e £ {±1}"'" in 



the first term defines time steps t when values in x are used in conditional distributions. To any such e, 
there corresponds a — e in the second term which defines times when values in x' are used in conditional 
distributions. This implies the required result. As a more concrete example, consider the path e — —1 in 
the first term. The contribution to the overall expectation is the supremum over / G of evaluation of — / 
on the left-most path of the x tree which is defined as successive draws from distributions pt conditioned on 
the values on the left-most path, irrespective of the x' tree. Now consider the corresponding path e = 1 in 
the second term. Its contribution to the overall expectation is a supremum over f £ of evaluation of — / 
on the right-most path of the x' tree, defined as successive draws from distributions pt conditioned on the 
values on the right-most path, irrespective of the x tree. Clearly, the contributions are the same, and the 
same argument can be done for any path e. 



Alternatively, we can see that the two terms in Eq. (30) are equal by expanding the notation. We thus claim 
that 



^Xi,x[r^pl^ei^X2,X2r^P2(-\xii<il))^<^2 • • ■ IE, 



^xi,x[~pi^ei^X2,x'2~P2{-\xi(<^l))^<^2 



XT, 2;^~PT(-|xi(ei )>■■■, XT- 1 



■ ^XT.x'j,r^pT(-\xii<il),---,XT-l(':T-l))^eT 



i(eT-i))lE£T sup <^ V -etfix't) \ 

u^^ U=i J . 



sup < y2etfixt) 



The identity can be verified by simultaneously renaming x with x' and e with — e. Since xixTx',e) ~ 
x{x',x, — e), the distributions in the two expressions are the same while the sum of the first term becomes 
the sum of the second term. 
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More generally, the split of Eq. (30) can be performed via an additional "centering" term. For any t, let Mt 



be a function with the property Mt(p, /, x, x', e) = Mt(p, /, x', x, — e) 
We then have 



E(x,x')~pJEe 



IE(x,x')~plEc 



— 2E(x,x')~pI'^e 



T 

sup Vet(/(xt(e)) - Aft(p,/,x,x',e)) 

T 

sup ^ -et(/(x;(e)) - M,(p, /, x, x', e)) 

T 

sup Vet(/(xt(e)) - Mt(p,/,x,x',e)) 



To verify equality of the two terms in (31 1 we can expand the notation. 



^a;i,a:'j~pi^<!lI^a;2,a;^~P2(-lxi(ei))^'!2 • ' • "^ajy ,3:^ ~PT (' I XI (ei ) , ■ • ■ ,Xr- 1 {^T - 1 )) "^^T 



(31) 



sup <^ ^-<^t{f{x't) -Mt(p,/,x,x',e)) 



{ yZ '=t(/(^0 - Mt{-p, f, X, x', e)) 
/S.F I ^ 



□ 



Proof of Corollary^ Define a function Mt as the conditional expectation 

Mt(p,/,x,x',e) =E^^p^(.|x^(,^),...,^^_^(,^_^))/(x). 
The property Aft(p, /, x, x', e) = Aft(p, /, x', x, — e) holds because x{^^ 2;', e) = x{^\ 2;, — e). 

Proof of Corollary The first steps follow the proof of Theorem [3] 



Vt < sup E 



and for a fixed p € *P, 
E[sup|^/(a;;)~/(a;o| 



□ 



(32) 



At this point we pass to an upper bound, unlike the proof of Theorcmjs] Notice that Pt(-|xi(£i)i ■ • • i Xt-i{^t~i)) 
is a distribution with support in A't(xi(ei), . . . , Xt-i(et-i)). That is, the sequence Xi(ei), . . . ,Xt-i(£t-i) de- 
fines the constraint at time t. Passing from t — T down to t = 1, we can replace all the expectations over pt 
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by the suprema over the set Xt, only increasing the value: 



< sup Ejj sup Egj . . . sup ii^ey 

xi,x[eXi 2:2,a;2e'V2(-|xi(«l)) XT,x'^eXT{xi{>^l),---,XT-li<:T-l)) 



E, 



sup jf]~6,(/(a;;)-/(x,))| 



sup 



sup E, sup|^-6t(/(x;(e))-/(xt(e)))| 



In the last equality, we passed to the tree representation. Indeed, at each step, we are choosing Xt,x'^ from 
the appropriate set and then flipping a coin et which decides which of Xt , x'^ will be used to define the 
constraint set through Xti^t)- This once again defines a tree structure and we may pass to the supremum 
over trees (x, x') G T. However, T is not a set of all possible A"- valued trees: for each t, Xt(e),X((e) e 
A't(xi(xi, x']^, ei), . . . , Xt-i(xf-i(et-i),Xj_;^(et_i), et_i)). That is, the choice at each node of the tree is 
constrained by the values of both trees according to the path. As before, the left-most path of the x tree (as 
well as the right-most path of the x' tree) is defined by constraints applied to the values on the path only 
disregarding the other tree. 

The rest of the proof exactly follows the proof of Theorem |3] □ 

Proof of Proposition \ 1 2\ Let Mt{f, x, x', e) = X]t=i /(Xi-(er))- Note that since x(a;, x' , e) = x{x', x, — e) 
we have that Mt{f,x,x',e) — Mt{f,:s.',x,—€). Using[Tl|we conclude that 



Vt < 2 sup E, 
(x,x')er 

= 2 sup E^ 
(x,x')er 



sup J2 (/, xt(e)) - ^ ^ (/, Xrier)) 
t=l \ r=l 

sup If^^^ti Xf (e) - ^ Xr(er) j \ 



T = l 



By linearity and Fenchel's inequality, the last expression is upper bounded by 

T / , t-i 



— sup Ee 

" (x,x')er 



< — sup Eg 
" (x,x')er 



sup 



< - sup *(/) + sup E 
" XfeJ" (x,x')er 



2i?2 2 

< h - sup E, 

<^ " (x,x')er 



2i?2 a^^^ 



a 



sup (f,ay2et ( xt(e) - T^X'xrleT 
t=i \ *-^r=i 

*(/) + f * (^a X] ^* (^*(^) - ^ E j 

(^«E^* (^Xi(e)-^^Xr(er 

1 



(33) 



Where the last step follows from Lemma 2 of [5] (with a slight modification). However since (x, x') S T are 
pairs of tree such that for any e S {±1}^ and any t G [T]. 

C(xi(ei), • ■ • ,X«-i(et-i),xt(e)) = 1 
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we can conclude that for any e G {±1}^ and any t E [T], 



1 



r=l 



Using this with Equation [33] and the fact that a is arbitrary, we can conclude that 

T 



Vt< inf <!— + "Van < 2^27?, 



i>0 a A 



t=i 



□ 



Proof of Proposition \ 1 g[ Let A/i(/, x, x', e) = /(Xi-i(£t-i))- Note that since x(a^,a;',e) = x(x',a;,— e) we 



have that il/f (/, x, x', e) = Mt(/, x', x, — e). Using 11 we conclude that 



Vt < 2 sup 
(x,x')er 

= 2 sup E, 
(x,x')er 



sup Vet ((/,Xt(e)) - (/,Xt-i(et-i))) 



sup ( /, Ve* (xt(e) - Xt-i(et-i)) 



As before, using linearity and Fenchel's inequality we pass to the upper bound 

/ T ' 



— sup Eg 
a (x,x')Gr 



sup / /,a^et (Xf(e) - Xt-i(et-i)) 



< — sup Ee 

a (x,x')er 



sup *(/) + a V (xt(e) - Xt-i(et-i)) 



< - sup *(/) + sup E 
a \ /6J=- (x,x')er 



2R^ 2 

< h - sup Ee 

a " (x,x')Gr 



(^"^^' (xt(e)-Xt-i(et-i)) 

T N 

( a^et (xt(e)-xt-i(et-i)) 



(34) 



Where the last step follows from Lemma 2 of 5^ (with slight modification). However since (x, x') G T are 
pairs of tree such that for any e e {±1}^ and any i € [T]. 

C(xi(ei), • • • ,Xt-i(et-i),xt(e)) = 1 

we can conclude that for any e G {±1}-^ and any t G [T], 

||xt(e)-xt-i(e*-i)IL <<5 
Using this with Equation [34] and the fact that a is arbitrary, we can conclude that 

V,<inf|^ + -^l<2i?.y2T 



□ 
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Proof of Lemma 17 We want to bound the supremum (as p ranges over Cp) of the distribution-dependent 
Rademacher complexity: 



sup91t(0(-7^),p) = sup E E, 

peqj peqj ((x.y).(x',y')))~p 



sup Vet0(/(xt(e)),yf(e)) 



for an associated process p defined in Section |3] To elucidate the random process p, we expand the succinct 
tree notation and write the above quantity as 



sup V,xi^x[r^p^yir^pi{-\xi)^ei^X2,x'2^p^y2^P2i-\Xli>^l):^2)^<i2 

P y'i~Pi{-\x'i) K2~P2(-|xi(ci),2;2) 



^XT,x'j,~p^yT~PT(-\Xl{<il),---;XT-l{'^T~l),XT)^eT 
y'T~PT{-\Xl(l^l),---:XT-l{(iT-l),x'j,) 



sup Vet0(/(xt),2/t) 



where Xti^t) now selects the pair {xt,yt) or {x^^y[). By passing to the supremum over yt,y't for all t, we 
arrive at 



sup 1Rt('/'(-7^), p) < supE^j^^- sup E^^E^^y^^p sup E^^ . . . E^^^^/^^p sup E,^ 
peV P yi^yl V2,y'2 vt-.Vt 



snp^^t(l){f{xt),yt) 



E^i^pSUpEeiE^2^pSUpEe2 . . .E^^^pSUpEe 



sup Vet(/)(/(a:t),yt) 



where the sequence of xj's and y^'s has been eliminated. By moving the expectations over xt's outside the 
suprema (and thus increasing the value), we upper bound the above by: 



< ^xi,...,XT~p SUpEc^ SUpEc^ . . . SUpEey 

VI V2 Vt 



sup^et(/>(/(a;t), 



yt] 



E sup Eg 

2;i,...,xt~P y 



sup Vet0(/(a;t),yt(e)) 



□ 



Proof of Lemma 18 First without loss of generality assume L ^ 1. The general case follow from this by 
simply scaling (j) appropriately. By Lemma [TT} 



d\T{(f){T),p) < E supEe 

Xl,...,XT~P y 



sup Vet0(/(a;t),yf(e)) 



(35) 



The proof proceeds by sequentially using the Lipschitz property of 0(/(a;t), yt(e)) for decreasing i, starting 
from t — T. Towards this end, define 



Rt= E supE, 

Xl,...,XT~P y 



sup^es(?!)(/(a;5),ys(e)) + ^ £sf{xs) 

s=l s=t+l 



Since the mappings yt+i, . . . , yr do not enter the expression, the supremum is in fact taken over the trees y of 
depth t. Note that Rq = *K(J-", p) is precisely the classical Rademacher complexity (without the dependence 



on y), while Rt is the upper bound on 9^t(0(-7^), p) in Eq. (35 1. We need to show Rt < Ro and we will 
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show this by proving Rt < Rt-i for all t G [T]. So, let us fix t e [T] and start with Rf. 
Rt 



xt= E supEe 

Xi,...,Xt~P y 



t T 

sup Ves(?^(/(a;s),ys(e)) + V esf{xs) 



s=t+l 



= E supEgj . . . supEjjEe^ 



xi,...,xt^P 



yi 



Vt 



sup Ves^(/(a;s),j/s) + V esf{xs) 



s=t+l 



E supEei ...supEet+i^T S'(a;i:T,yi:t,ei:t-i,et+i:T) 



Xi,...,a:T~P 



with 



'S'(a^l:T, yi:t, ei:t-l, et+l:T) = E<: 



sup Ves0(/(a;s),2/s) + V es/(a;s) 

o i sup Ves(?!)(/(a;^),ys) + (/)(/(a;t),yt) + J" e^/Ca^s) ? 

;^ S sup Ve50(/(a;s),2/s) - (/)(/(a;t),2/t) + V esf{xs)\ 

2 Ue^^ J 



The two suprema can be combined to yield 

2S'(a;i:T, yi:t, ei:t_l, ef+l:T) 

= sup \ ^e,{(t){f{xs),ys) + (t>{9{xs),ys))+4'U{xt),yt)-<t>{9{xt),yt)+ ^ (^s{f{xs) + 9{xs))\ 
< sup \ ^€s{(l){f{xs),ys) + <t>{9{xs),ys)) + \f{xt)- g{xt)\+ J2 ^M{Xs) + g{Xs))\ (*) 

f'9^^ [s=l s=t+l J 

= sup \ ^es{<l){f{xs),ys) + <l){9ixs),ys))+ f{xt)- 9{xt)+ Yl ^Mi^s) + 9{xs)) \ (**) 

f'9^^ [s=l s=t+l J 

The first inequality is due to the Lipschitz property, while the last equality needs a justification. First, it 
is clear that the term (**) is upper bounded by (*). The reverse direction can be argued as follows. Let a 
pair {f*,9*) achieve the supremum in (*). Suppose first that f*{xt) > 9*{xt)- Then {f*,9*) provides the 
same value in (**) and, hence, the supremum is no less than the supremum in (*). If, on the other hand, 
f*{xt) < g*{xt), then the pair {g* , f*) provides the same value in (**). 

We conclude that 

S{X\;T, yi:t, ei;t-l,et+l:T) 

-o {y2^s{(l){f{xs),ys) + H9{xs),ys))+f{xt)-9{xt)+ y] es{f{xs)+9{xs))> 

= ^ {s^Py2^Mfixs),ys) + f{xt) + V esf{xs)> + 7){supy2^s(p{fixs),ys)-f{xt)+ V esf{xs)> 
= Eet sup ^ ^es(/>(/(a;s),2/s) + ei/(xt) + ^ es!{xs)\ 



s=t+l 
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Thus, 



i?t = E supE^i . . .supEe 5'(a;i:T,2/i:t,ei:t-i,et+i:T) 

{t-1 T 
€s(j}{f{Xs),ys) + £5/(3 
7^1 



x\ ,...,xt^P 



{t-1 T 
ts4>{f{xs), ys) + esf{xs) 
7^1 



Xl,...,XT~P 



where we have removed the supremum over yt as it no longer appears in the objective. This concludes the 
proof. 

□ 



Proof of Lemma\20[ Notice that p defines the stochastic process p as in (12) where the i.i.d. t/t's now 



play the role of the 64 's. More precisely, at each time two copies Xt and x[ are drawn from the marginal 
distribution Pt(-|xi(yi), . . . ,Xt-i{yt-i)), then a Rademacher random variable yt is drawn i.i.d. and it indi- 
cates whether xt or x'^ is to be used in the subsequent conditional distributions via the selector Xt{yt)- This 
is a well-defined process obtained from p that produces a sequence of {xi,x'i,yi), . . . , (x^, a;^, The x' 
sequence is only used to define conditional distributions below, while the sequence (xi, j/i), . . . , (xt, J/t) is 
presented to the player. Since restrictions are history-independent, the stochastic process is following the 
protocol which defines p. 

For any p of the form described above, the value of the game in ([7| can be lower-bounded via Proposition [2] 



VJ"P > E 



E 



^ l^^-^'^ixt.vt) \yt-ftixt)\ {x,y)i..t-i - inf^X! 1^* " 



.t=i 

■ T 



.t=i t=i 



A short calculation shows that the last quantity is equal to 

T T 

E sup V (1 - \yt - f{xt)\) - E sup V ytfixt). 
The last expectation can be expanded to show the stochastic process: 

T 

'^xi,x[~Pi^yi^x2,x'2~P2{-\xi(yi))^V2 ■ ■ ■ ^XT,xij,~pT{-\xiiyi),---,XT-i(yT-i))^yT sup ytf{xt) 



= E(x,x')~pII^e 



= 5Ht(^,p) 



sup Vef/(xt(e)) 



Since this lower bound holds for any p which allows the labels to be independent ±1 with probability 1/2, 
we conclude the proof. □ 
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Proof of Lemma 21 For the purposes of this proof, the adversary presents yt an i.i.d. Rademacher 
random variable on each round. Unhke the previous lemma, only the {xt} sequence is used for defining 
conditional distributions. Hence, the x' tree is immaterial and the lower bound is only concerned with the 
left-most path. The rest of the proof is similar to that of Lemma [201 



VJ"P > E 



E 



■ T T ■ 

J2 i^t^i^t.yt) \yt-ftixt)\ {x,y)i.,t^i - inf ^ lyt - /(a;t)| 

.t=i ^ t=i 

.t=i ^ t=i 



As before, this expression is equal to 

T T 

E sup > VtfiXt) = ^Xir^pi )Eyy sup > ^(/(Xt) 



sup Vet/(xt(-l)) 



□ 
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